Human closest gene
Closest gene
Some analyses performed on sequencing data use genes as the basis for analysis. In RNA-seq for example, a table of count data is often produced where a value for a gene is calculated by counting reads that fall within or overlap the gene. Some analyses are not directly based around genes. A common method used in ChIP-seq analysis is peak calling, where a test sample is generally compared to a background sample to identify regions that have a high read count in the test sample. These peaks may be exonic or intronic and can simply be annotated with the overlapping gene. The peaks may also be intergenic in which case they are often annotated with the closest gene.
[insert picture/diagram of exonic, intronic and intergenic peaks]
Genes vary greatly in size and the spread of genes is not uniform across chromosomes. It may be expected that longer genes, genes in the middle of gene deserts and those at the ends of clusters of genes are identified more frequently when annotating data with the closest gene to a region of interest.
The aim of this study was to see whether any biases were found when annotating random positions within the genome with the closest gene. This was carried out using the following steps:
- generating random locations within the mouse genome and annotating these with the closest gene
- seeing whether any particular genes were overrepresented
- performing GO analyses on the lists of closest genes
Generating genes by random position
The python script generate_genelist_from_random_positions.py produces a list of 200 genes. It generates a number of random positions along each chromosome. The number of positions generated is proportional to the length of the chromosome. The chromosome lengths are taken from a separate file (chr_list.txt) A gtf file containing all the genes in the genome is required. Each line from the gtf file is parsed and all the genes are stored by chromosome. Each random position is looped through to find the closest (or overlapping) gene and the output is written to a new file. This script has been run to create lists of genes for the mouse genome using the gtf file for genome version GRCm38.95
for i in {1..200}; do sbatch --mem 10G ./random_pos_wrapper.sh $i; done
The files produced have quite a bit of information (gene_id gene_name chromosome start end biotype distance random_pos)
For now, we want to extract the gene names and remove header for i in closest*; do cut -f2 $i | tail -n 200 > ${i%".txt"}_just_genes.txt; done
Checking for overrepresentation at the gene level
There are 200 sets of 200 genes. We can plot out which genes appear the most often in the 200 sets. The top hit, Gm20388, appears 69 times in mouse. (Does each gene only appear once per list? I don’t think I filter on that.)
In human, first gene is not really relevant as, in contrast to mouse, most chromosomes have a gene quite near the start. What might be useful is to label genes closest to the centromeric regions, though I don’t think that this is worth doing across all the chromosomes as some gene-less centromere regions are quite small
There are a few genes without lengths - presumably these were not found in the genfo file - though the genfo file should contain the same genes as the gtf that these gene lists were derived from (with minor differences in genome version - maybe the genfo file should be reprocessed)
[1] 16503
[1] 299
I’ll carry on for now - but this should be checked out.
Plot number of times a gene appeared vs its length
Genes on the y chromosomes mess this up a bit
Gene ontology analysis
Each of the 200 sets of genes was run through a gene ontology analysis
The number of gene sets that returned significant results from the GO overrepresentation analysis
In this set, we had 200 sets of gene lists, while in the others we had 100 per category. The threshold in the mouse data was therefore doubled. If we do this for human, we don’t get any results, so I’m reducing the threshold to 10. I think that the cutoffs for mouse are probably too stringent, it was just that I was getting so many results that I set the stringency quite high.
Each line on the beanplot shows the number of significant (corrected p-value < 0.05) functional categories returned from gene ontology analysis.